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Learning in networks of binary synapses is known to be an NP-complete problem. A combined 
stochastic local search strategy in the synaptic weight space is constructed to further improve the 
learning performance of a single random walker. We apply two correlated random walkers guided 
by their Hamming distance and associated energy costs (the number of unlearned patterns) to learn 
a same large set of patterns. Each walker first learns a small part of the whole pattern set (partially 
different for both walkers but with the same amount of patterns) and then both walkers explore their 
' respective weight spaces cooperatively to find a solution to classify the whole pattern set correctly. 

The desired solutions locate at the common parts of weight spaces explored by these two walkers. 
The efficiency of this combined strategy is supported by our extensive numerical simulations and 
the typical Hamming distance as well as energy cost is estimated by an annealed computation. 
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The binary synaptic weight is more robust to noise and much simpler for large-scale electronic implementations 
compared with its continuous counterpart. However, learning in networks of binary synapses is known to be an 
NP-complete problem This means that if one could find an algorithm to solve this problem in a polynomial time, 
^ any other NP problems (a set of decision problems that can be resolved in polynomial time on a non-deterministic 
I Turing machine) can also be solved in polynomial time. For a single layered feed forward network with binary weights 
I t ■ (we call this network binary perceptron), the storage capacity was predicted to be as — 0.83 [2] provided that the 
, number of weights N tends to be infinity. Here we define the ratio of the number of patterns P to iV as the constraint 
^ density a and the theoretical limit of a for the perceptron is termed the storage capacity. The binary perceptron 
I , attempts to perform a random classification of aN random input patterns [3]. Many efforts have been devoted to 
^ ■ the nontrivial algorithmic issue of this difficult problem For all finite a, a discontinuous ergodicity breaking 

Q \ transition for the binary perceptron at finite temperature is predicted by the dynamic mean field theory . When the 
^ ■ transition temperature is approached, the traditional simulated annealing process is easily trapped by the suboptimal 
I— configurations where a finite fraction of patterns are still not learned 0. The difficulty of local search heuristics for 
learning is likely to be connected to the fact that exponentially many small clusters coexist in the weight space with 
a more exponentially large number of suboptimal configurations [l^ . Here, we define a connected component of the 
weight space as a cluster of solutions in which any two solutions are connected by a path of consecutive single- weight 
flips 13]. A configuration of synaptic weight is identified to be a solution if it is able to learn a prescribed set 
\^ of patterns. Various stochastic local search strategies by virtue of random walks have been used to find solutions 
of constraint satisfaction problems [l3 - [l9| . In our previous study (lo| . we suggested a simple sequential learning 
mechanism, namely synaptic weight space random walking for the perceptronal learning problem. In this setting, aN 
: patterns are presented in a randomly permuted sequential order and random walk of single- or double-weight flips is 
T-H ' performed until each newly added pattern is correctly classified (learned). The previously learned patterns are not 
; allowed to be misclassified in later stages of the learning process. This simple sequential learning rule was shown to 
have good performances on networks of iV '-^ 10'^ synapses or less. The mean achieved a is 0.57 for N = 201 and 0.41 
for N = 1001. 

In this work, we improve the learning performance by introducing a smart combined strategy. Instead of using 
I a single random walker, we apply two correlated random walkers guided by their Hamming distance and associated 
■ ■ energy costs. Both walkers expect to learn a same large set of patterns, but each walker first learns a small part of 
the whole pattern set (partially different for both walkers but with the same constraint density). Then both walkers 
explore their respective current weight spaces cooperatively until either of them finds a solution to classify correctly 
the whole pattern set. Therefore, each walker only needs to make the corresponding residual part of the whole pattern 
set learned. Notice that the weight spaces both walkers explore separately are actually different since the small set 
of patterns they have learned are not completely identical despite the same amount of learned patterns. If a solution 
exists, the found solution should belong to one of the common parts of the weight spaces explored by both walkers, 
therefore, the small Hamming distance between these two walkers and zero energy are favored during the multiple 
random walkings. In fact, the common parts will appear as independent clusters once the larger expected a is finally 
reached (see Fig. (Hb)). 
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The binary perceptron realizes a random classification of P random input patterns (see Fig. HJa)). To be more 
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FIG. 1: (Color online) The sketch of the binary perceptron and the multiple random walkings in the weight spaces, (a) TV 
input units (open circles) are connected directly to a single output unit (solid circle). A binary input pattern (^1,^21 ■ • • iCjv) '^^ 
length A'^ is transferred through a sign function to a binary output a'^, i.e., a'^ = sgn(^^-^ Ji^f ) . The set of A'^ binary synaptic 
weights {Ji} is regarded as a solution of the perceptron problem if the output a'^ = ctq for each of the P = aN input patterns 
fj, € [1,-P], where CTq is a preset binary value, (b) Mechanism of multiple random walkings. All configurations (represented by 
open or solid circles) are the solutions learning the initial small set of patterns for each random walker (A or B). The arrow 
indicates the movement for each walker in the weight space via SWF. The movement favors small Hamming distance between 
A and B and decreasing energy cost. The desired solutions (represented by solid circles) for the expected larger a locate at the 
common parts of weight spaces both walkers explore. Ifere two common parts are shown, but actually one or more than two 
are possible. 

precise, the learning task is to seek for an optimal set of binary synaptic weights {Ji}fLi that could map correctly 
each of random input patterns {Cf = I, ■ ■ ■ , P) to the desired output CTq assigned a value ±1 at random. Given the 
input pattern the actual output cr^ of the perceptron is a'^ — sgn{J2i Ji^t) where Ji takes ±1 and takes ±1 
with equal probability. If cr'^ = Cq , we say that the synaptic weight vector J has learned the /i-th pattern. Therefore 
we define the number of patterns mapped incorrectly as the energy cost E{J) — Q (— (Tq J^i Ji^i) where 8(x) is 
a step function with the convention that Q{x) = if a; < and Q{x) = 1 otherwise. In the current setting, both {^f } 
and the desired output {ctq } are generated randomly independently. Without loss of generality, we assume CTq = +1 
for any input pattern in the remaining part of this letter, since one can perform a gauge transformation — )■ S^^ctq 
to each input pattern without affecting the result. 

Before introducing the combined learning strategy, we first briefly outline two simple local search strategies p^ . i.e., 
single- weight flip (SWF) and double- weight flip (DWF). To learn a given set of random patterns, we first generate an 
initial weight configuration ( J°, , . . . , at time t — 0. The first pattern is then presented to the perceptron. If 
this pattern is correctly learned by the initial weight configuration, then the second pattern is presented, otherwise 
the weight configuration is modified by a sequence of SWF or DWF until is correctly classified. All patterns are 
applied in a sequential order. Suppose at time t the weight configuration is J* = ( , J21 ■ ■ ■ ^ Jn)^ ^'^'^ suppose this 
weight configuration correctly classifies the first m input patterns ^^(/i = 1, . . . ,m) but not the (m -I- l)-th pattern 
^m-i-i^ The random walker will keep wandering in the weight space of the first m patterns via SWF or DWF until a 
configuration that correctly classifies is reached. In the SWF protocol, a set A{t) of allowed single-weight flips 

is constructed based on the current configuration J* and the m learned patterns. A{t) contains all integer indexes 
j G [1, N] with the property that the single weight flip J* — — Jj does not make any barely learned patterns G [1, m] 
(whose stability field h^^ = Ji£/^ = -1-1) being misclassified. At time t' = t + 1/N, an index j is chosen uniformly 

randomly from set A{t) and the weight configuration is changed to J* such that Jf = J* if i 7^ j and Jj — — J]- The 
DWF protocol is very similar to the SWF protocol with the only difference that the set A{t) contains pairs of integer 
indexes (i, j) for allowed double-weight flips. This set can be constructed as follows. For the current configuration j', 
if there are no barely learned patterns (ft-^ = +1 or +3 for double-weight flips) among the first m learned patterns, 
A{t) includes all the N{N — l)/2 pairs of integers (i, j) with 1 < i < j < N . Otherwise, randomly choose a barely 
learned pattern, say mi € [IjTt] and for each integer i G [1, A'^] with the property JfS,^^ < 0, put i into another set 
B{t), then do the following: (1) if J*^f < for all the other barely learned patterns, then add all the pairs 
with j ^ B{t) into the set A{t)\ (2) otherwise, add all the pairs («,j) into the set A{t), with the property that the 
integer j ^ B{t) satisfles t/jCj" < for all those barely learned patterns € [1,to] with > 0. In practice, when 
the number of added patterns is small, we use an alternative scheme where the set Ait) is not pre-constructed and 
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instead we randomly select a pair and flip them if the flip would not make any previously learned patterns 

misclassified. Once the number of flippable pairs of weights is substantially reduced, we will use the DWF protocol 
described above to keep learning proceeding. In this way, a learning of a relatively large set of patterns would be not 
very time consuming. 

The combined strategy to improve the learning performance of single walker is illustrated in Fig. [ijb). Before 
the learning starts, we divide the given set of patterns to be learned into three parts, namely A, B and C with the 
property that the number of patterns va A^J B equals to that in A^JC. Then the first walker tries to learn AU B 
by DWF protocol while the second walker tries to learn ^ U C by the same protocol. After AU B and AU C have 
been learned by both walkers separately, each walker keeps wandering in its current weight space via SWF with the 
property that all previously learned patterns are still learned and no new pattern is added. In addition, they should 
communicate with each other by lowering down the sum of the Hamming distance and the associated energy costs. 
If the sum goes up, we accept the walking (both walkers modify their current configurations via SWF one time) 
with the probability g-P^i^f^d+Ae) ^j^gj-g AHd is the change of Hamming distance with respect to the walking and 
Ae the change of energy cost density. /3 serves as a control parameter to be optimized. One could also introduce 
another inverse temperature 7 (see Eq. ^) to control the decreasing rate of energy cost [20l - [23 |. For simplicity, we 

set 7 = /? in our simulations although they can be changed independently. Hd = (^1 — J2i "^i ^''•^i^^') where J^^^ 

is the current weight configuration of the first walker while J*-^-* the second walker. For the first walker, the energy 
cost is the number of patterns in C misclassified by j'^-* and the energy cost for the second walker is the number of 
patterns in B misclassified by J^^-*. Once either of these two energy costs becomes zero, the whole learning process 
will be terminated and the whole set of patterns is learned. Otherwise, the learning process stops if the maximal 
number of attempts for multiple random walkings, namely Tmax, is saturated. Tmax is a free parameter whose value 
should be chosen considering the trade-off between efficiency (a solution is found) and computational cost. 

The combined local search strategy through multiple random walkings actually utilizes the smoothness of the weight 
space at the initial low constraint density to achieve the solution at expected high constraint density. In this process, 
the Hamming distance between both walkers and their associated energy costs play a key role in guiding either (or 
both) of them to the desired solution. To get a preliminary estimate of the optimal inverse temperature /?, we perform 
an annealed computation which is able to give us a crude knowledge of the relation between the Hamming distance 
(or energy cost) and the inverse temperature. In our current setting, we can write the partition function as: 
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where we still distinguish /3 and 7 which are set to be equal in our simulations. Note that the added prefactor 
jV~i/2 makes the argument of &{■) of order of unity for the sake of statistical mechanics analysis. In annealed 
approximation, we compute the disorder average of partition function (Z) where(- • • ) denotes the average over the 
input random patterns. We skip the detail of computation here and define the overlap between configurations J^^-* 
and J*-^-* as g = J2i j't'^ j'f'^ then the annealed approximated free energy density fann is given by: 
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FIG. 2: (Color online) Comparison of learning performances using DWF and multiple random walkings. The pattern length is 
A'^ = 201. For each sample, we try to learn a fixed set of random unbiased patterns ten times and record the fraction of success. 
The error bar indicates the fluctuation across eight random samples. We choose the optimal temperature according to Eq. ^ 
with relatively small predicted H^"" and Cann- We set Tmax = 5 x 10'' Af, and the initial constraint density aj — 0.4. 



where H(x) = f°° Dt and Dt = 
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ttc is the common constraint density denoted by the ratio of the number 



of patterns in the set A to iV, and the set A is the common part learned by both walkers at the initial stage, q is the 
conjugate counterpart of overlap q and both of them are determined by the following recursive equations: 



q = tanh q 



a/I — g^arccot 
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After the solution of the above recursive equations is obtained, the annealed typical Hamming distance is calculated 
via i/^"" — and the annealed energy density is evaluated as: 
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In practical learning of a single instance, we choose the optimal temperature where predicted and Cann take small 

values (e.g., around 0.06 and 0.01 respectively) and these predicted values can also be compared with those obtained 
during the actual learning processes. Note that the learning performance is not very sensitive to small changes in the 
temperature as long as the used temperature yields relatively small predicted iJ^"" and Cann ■ In addition, we define 
an initial constraint density aj as the number of patterns in AU B or AU C over N. Its value should be chosen to be 
relatively small such that each walker can learn the initial pattern set and the common parts of weight spaces both 
walkers explore exist. In practice, we choose ai — 0.4, 0.35, 0.3 for N = 201, 501, 1001 respectively. 

We apply the proposed combined local search strategy to learn random patterns of length TV = 201 and compare the 
result with that obtained by DWF. DWF is able to go on even if SWF does not work, i.e., all weights are frozen but 
flipping certain pairs of weights is still permitted. Hence DWF can achieve higher mean a than SWF. Furthermore, 
its learning time grows almost linearly with a up to the constraint density where DWF can not proceed any more . 
Therefore we only consider the comparison between combined local search strategy and DWF. As shown in Fig. [21 
multiple random walkings do outperform DWF despite the large observed fluctuation across different samples at large 
expected a. In fact, DWF is easily trapped if the current configuration is frozen with respect to double-weight flips, 
which occurs with high probability when the constraint density becomes large [l3|. It is expected that the weight 
space will become rather rugged at high a in the sense that exponentially many small solution clusters appear and 
most of them can not be connected by simple single- weight or double- weight flip. However, multiple random walkings 
start to find the solution for a high a from the weight space at a relatively small a/ where a big connected component 
each walker will probe is expected. To achieve the desired solution in available time scales, the Hamming distance 
between both walkers and the associated energy costs are needed to guide both walkers, since the solutions for the 
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FIG. 3: (Color online) Median learning time Tmed versus constraint density a using multiple random walkings for different A'^. 
16 random pattern sets are generated and the corresponding learning times (walking steps) are ordered; r„ietj is the median 
of this ordered sequence. Those cases where the learning fails within Tmax are put at the top of the ordered learning time 
sequence. Tmax = 5 x 10* Af, 2 x 10* A*', 10* A'^ for N = 201, 501, 1001 respectively. We choose the optimal temperature according 
to Eq. Q with relatively small predicted ff^"" and Cann- The solid lines are power-law fittings of the form Tmed oc (ocr — a)"* 
where a^r ^ 0.72, 0.575, 0.475, and S ~ 1.374, 1.257, 0.644 for N = 201, 501, 1001 respectively. The dashed-dotted linemdicates 
the mean constraint density achieved by DWF for N — 1001, dashed line for A'' = 501 and dotted line for A'' — 201 IQ], 

high a locate at the common parts of weight spaces both walkers explore and these common parts will emerge as 
independent clusters when the high a is finally reached. As shown in Fig. [5J the combined strategy still has a finite 
probability to learn 0.787V input random patterns while DWF is not able to learn patterns with a > 0.70. Even at 
small constraint density a ~ 0.60, the fraction of learning success by multiple random walkings can be nearly 100% 
with less fluctuations and very small walking steps (see Fig. 0] (a)). In Fig. [31 we also show the median learning time 
(walking steps) of 16 random pattern sets. This is done by recording walking steps needed to learn each pattern set, 
then the learning times are ordered [l^. We define Tmed as the median value of this ordered sequence. Those cases 
where the learning fails within Tmax are put at the top of the ordered learning time sequence. As shown in Fig. [31 the 
critical constraint density ttcr at which 50% of the presented pattern sets are learned successfully is larger than the 
mean one achieved by DWF [l^. acr — 0.72,0.575,0.475 for N = 201,501,1001 respectively, and Tmed grows with 
a roughly as a power law Tmed cx (ctcr — a)~^ . As N increases, the critical value acr decreases, which is consistent 
with the fact that the quality of a polynomial algorithm for the binary perceptron decreases with increasing system 
size [sl. [Tl|. However, the combined local search strategy does improve the learning performance of a single random 
walker. 

Fig. [4] gives the evolution of Hamming distance and energy cost for multiple random walkings to find a solution of 
a single instance at high constraint density. For N = 201, one can see from Fig. HJa) that a solution for a ~ 0.597 
can be found within 2000 walking steps. In addition, the annealed computation reproduces the plateau values of 
Hamming distance and energy cost with very good agreement. If both walkers are guided only by Hamming distance, 
the solution can also be found for a ~ 0.70 and N = 201, but the fraction of success is reduced to 33.8% ± 16.5% 
with f3 = 2.6. As displayed in Fig. [3^b), the evolution of Hd seems to be highly correlated (almost synchronous) with 
that of the energy cost and i/™" is consistent with the plateau value of Hamming distance. Without the guide of 
Hamming distance, both walkers are easily trapped by suboptimal configurations with a small finite energy. However, 
guided by both Hamming distance and energy, searching for a solution can be speeded up since the absolute value of 
NAHd is usually comparable to that of Ae during the learning process. Notice that most of weight configurations 
in weight spaces explored by both walkers act as suboptimal configurations for the learning problem at expected high 
constraint density and there may exist very narrow corridors to the common part where the desired solutions belong 
to. If we apply a single walker to explore its weight space after the initial stage to find a solution and this walker is 
guided only by energy, we found it easily gets stuck in local minima of energy landscape as well. Fig. [H (c) plots the 
evolution of Hamming distance and energy cost during the whole learning process for a very high constraint density 
a ~ 0.776 for which a larger number of walking steps to achieve the desired solution are required. For A^ — 1001 
and expected a = 0.50, it is very difficult to find a solution through two correlated random walkers provided that the 
maximal number of walking attempts is limited. However, we still found a solution with Tmax = 5 x lO'^A^ and the 
evolution of Hd and e is presented in Fig. [H (d). In Fig. [H (c) and (d), our annealed estimates of Hd and e seem to 
be larger than the actual plateau values, however, they still provide us the information to select the optimal control 
parameter f3. To more accurately predict the actual plateau values of Hd and e, a quenched computation within 
replica symmetry approximation or one-step replica symmetry breaking approximation is needed 0, which we leave 
for future work. 

In conclusion, we apply two correlated random walkers instead of single walker to improve learning performance 
of the binary perceptron. The solution for small ai can be easily obtained by SWF or DWF Both walkers 
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FIG. 4: (Color online) Evolution of Hamming distance and energy cost for multiple random walkings to find a solution of single 
instance at high constraint density. The evolution corresponds to the walker finding the desired solution, (a) A'' — 201, q — 
0.597. The dashed-dotted line stays for the annealed Hamming distance while the dotted line the annealed energy cost. The 
temperature is chosen according to Eq. Q with relatively small predicted H°["" and eann- ai = 0.4, /3 = 1.8, Tmax = 5 x 10"* A'', 
(b) The same as (a), but both walkers are guided only by Hamming distance and the temperature is chosen with relatively 
small predicted a = 0.697,^ = 2.6. (c) The same as (a), but for a = 0.776,^ = 1.7. (d) N = 1001, a = 0.50, q/ = 

0.3,^ = 1.9,7;„a. = 5 X lO^'iV. 



then explore their respective weight spaces cooperatively to reach one of the common parts to which the solution for 
high a belongs. The smart combined strategy through multiple random walkings makes the whole weight space at 
expected high constraint density ergodic for both walkers and the learning in the most hard phase becomes possible. 
The efSciency of our method depends on the choice of a suitable temperature to help the walker overcome energy 
or entropic barriers and also relies on the smoothness of the initial weight space at aj whose value should ensure 
the common parts of weight spaces both walkers explore exist. To this end, we derive annealed estimations (Eq. (jH]) 
and Eq. ^) to select the suitable temperature with small predicted H°[™ and Cann- Interestingly, the Hamming 
distance is found to be important for guiding correlated walkers to find a solution (see Fig. H] (b)). However, as N 
(also expected a) increases (see Fig.|3l), the learning time to reach the desired solution grows rapidly, or a much larger 
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Tmax should be preset. This supports the coniputationai difficulty to find solutions for binary perceptron by virtue 
of local search heuristics [1, [13, El- Future research is needed to acquire a full understanding of this point. 
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